YOLO Object Detection

Object Detection and YOLO

What is Object Detection ?

Object Detection

Classification with localization

Classification with localization

Defining the target label y

image

输出:包含图片中存在的对象及定位框

  • 行人,0 or 1;
  • 汽车,0 or 1;
  • 摩托车,0 or 1;
  • 图片背景,0 or 1;
  • 定位框:bx、by、bh、bw

其中,bx、by表示汽车中点,bh、bw分别表示定位框的高和宽。以图片左上角为(0,0),以右下角为(1,1),这些数字均为位置或长度所在图片的比例大小。

Target Label Y

image

Loss Function

如果采用平方误差形式的损失函数:

  • Pc=1: image
    此时,我们需要关注神经网络对所有输出值的准确度;

  • Pc=0: image
    此时,我们只关注神经网络对背景值的准确度。

当然在实际的目标定位应用中,我们可以使用更好的方式是:

  • 对c1、c2、c3使用softmax交叉熵损失函数;
  • 对边界框的四个值应用平方误差或者类似的方法;
  • 对Pc应用logistic regression损失函数,或者平方预测误差。

Prerequisites

Sliding Windows Detection Algorithm

image

image

Weakness : computational cost

Improvement : Convolutional Implementation of Sliding Windows

Turning FC layer into convolutional layers

image

Convolutional Implementation of Sliding Windows

image

IoU (Intersection over union)

More generally, IoU is a measure of the overlap between two bounding boxes.

IoU

Non-max Suppression

Non-max Suppression

Non-max Suppression

Anchor Box

Overlapping Objects

image

Anchor Box Algorithm

With two anchor boxes: Each object in training image is assigned to grid cell that contains object’s midpoint and anchor box for the grid cell with highest IoU.

Demo

enter image description here

Weakness

  • 如果我们使用了两个Anchor box,但是同一个格子中却有三个对象的情况,此时只能用一些额外的手段来处理;
  • 同一个格子中存在两个对象,但它们的Anchor box 形状相同,此时也需要引入一些专门处理该情况的手段。

但是以上的两种问题出现的可能性不会很大,对目标检测算法不会带来很大的影响。

How to choose anchor box ?

  • 一般人工指定Anchor box 的形状,选择5~10个以覆盖到多种不同的形状,可以涵盖我们想要检测的对象的形状;

  • 高级方法(K-means 算法):将不同对象形状进行聚类,用聚类后的结果来选择一组最具代表性的Anchor box,以此来代表我们想要检测对象的形状。

YOLO

yolo

  • 将图片分割成n×n个小的图片;

  • 采用图像分类和定位算法,分别应用在图像的n×n个格子中;

  • 对于每一个小格子,输出一个预定义标签:yi=[Pc bx by bh bw c1 c2 c3] 对于不同的网格 i 有不同的标签向量yi;

  • 将n×n个格子标签合并在一起,最终的目标输出Y的大小为:n×n×8(这里8是因为例子中的目标值有8个)

YOLO notation:

  • 将对象分配到一个格子的过程是:观察对象的中点,将该对象分配到其中点所在的格子中,(即使对象横跨多个格子,也只分配到中点所在的格子中,其他格子记为无该对象,即标记为“0”);
  • YOLO显式地输出边界框,使得其可以具有任意宽高比,并且能输出更精确的坐标,不受滑动窗口算法滑动步幅大小的限制;
  • YOLO是一次卷积实现,并不是在n×n网格上进行n^2次运算,而是单次卷积实现,算法实现效率高,运行速度快,可以实现实时识别。

Training Set

  • 输入X:同样大小的完整图片;
  • 目标Y:使用3×3网格划分,输出大小3×3×16(假设使用2个anchor boxes);
  • 对不同格子中的小图,定义目标输出向量y。

yolo

Prediction

输入与训练集中相同大小的图片,同时得到每个格子中不同的输出结果:3×3×16。

enter image description here

Non-max Suppression

假设使用了2个Anchor box,那么对于每一个网格,我们都会得到预测输出的2个bounding boxes,其中一个Pc比较高;

enter image description here

抛弃概率Pc值低的预测bounding boxes;

enter image description here

对每个对象(如行人、汽车、摩托车)分别使用NMS算法得到最终的预测边界框。

enter image description here

YOLO Practice for Car Detection

Problem Statement

Here’s an example of what your bounding boxes look like:

enter image description here

If you have 80 classes that you want YOLO to recognize, you can represent the class label c either as an integer from 1 to 80, or as an 80-dimensional vector (with 80 numbers) one component of which is 1 and the rest of which are 0. The video lectures had used the latter representation; in this notebook, we will use both representations, depending on which is more convenient for a particular step.

Because the YOLO model is very computationally expensive to train, we will load pre-trained weights for you to use.

YOLO Model Details

YOLO (“you only look once”) is a popular algoritm because it achieves high accuracy while also being able to run in real-time. This algorithm “only looks once” at the image in the sense that it requires only one forward propagation pass through the network to make predictions. After non-max suppression, it then outputs recognized objects together with the bounding boxes.

First things to know:

  • The input is a batch of images of shape (m, 608, 608, 3)
  • The output is a list of bounding boxes along with the recognized classes. Each bounding box is represented by 6 numbers (pc,bx,by,bh,bw,c) as explained above. If you expand c into an 80-dimensional vector, each bounding box is then represented by 85 numbers.

We will use 5 anchor boxes. So you can think of the YOLO architecture as the following: IMAGE (m, 608, 608, 3) -> DEEP CNN -> ENCODING (m, 19, 19, 5, 85).

Lets look in greater detail at what this encoding represents.

Encoding architecture for YOLO

If the center/midpoint of an object falls into a grid cell, that grid cell is responsible for detecting that object.

Since we are using 5 anchor boxes, each of the 19 x19 cells thus encodes information about 5 boxes. Anchor boxes are defined only by their width and height.

For simplicity, we will flatten the last two last dimensions of the shape (19, 19, 5, 85) encoding. So the output of the Deep CNN is (19, 19, 425).

Flattening the last two last dimensions

Now, for each box (of each cell) we will compute the following elementwise product and extract a probability that the box contains a certain class.

Find the class detected by each box

Here’s one way to visualize what YOLO is predicting on an image:

  • For each of the 19x19 grid cells, find the maximum of the probability scores (taking a max across both the 5 anchor boxes and across different classes).
  • Color that grid cell according to what object that grid cell considers the most likely.

Doing this results in this picture:

Each of the 19x19 grid cells colored according to which class has the largest predicted probability in that cell

Note that this visualization isn’t a core part of the YOLO algorithm itself for making predictions; it’s just a nice way of visualizing an intermediate result of the algorithm.

Another way to visualize YOLO’s output is to plot the bounding boxes that it outputs. Doing that results in a visualization like this:

enter image description here

In the figure above, we plotted only boxes that the model had assigned a high probability to, but this is still too many boxes. You’d like to filter the algorithm’s output down to a much smaller number of detected objects. To do so, you’ll use non-max suppression. Specifically, you’ll carry out these steps:

  • Get rid of boxes with a low score (meaning, the box is not very confident about detecting a class)
  • Select only one box when several boxes overlap with each other and detect the same object.

Filtering with a threshold on class scores

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# GRADED FUNCTION: yolo_filter_boxes

def yolo_filter_boxes(box_confidence, boxes, box_class_probs, threshold = .6):
"""Filters YOLO boxes by thresholding on object and class confidence.

Arguments:
box_confidence -- tensor of shape (19, 19, 5, 1)
boxes -- tensor of shape (19, 19, 5, 4)
box_class_probs -- tensor of shape (19, 19, 5, 80)
threshold -- real value, if [ highest class probability score < threshold], then get rid of the corresponding box

Returns:
scores -- tensor of shape (None,), containing the class probability score for selected boxes
boxes -- tensor of shape (None, 4), containing (b_x, b_y, b_h, b_w) coordinates of selected boxes
classes -- tensor of shape (None,), containing the index of the class detected by the selected boxes

Note: "None" is here because you don't know the exact number of selected boxes, as it depends on the threshold.
For example, the actual output size of scores would be (10,) if there are 10 boxes.
"""

# Step 1: Compute box scores
### START CODE HERE ### (≈ 1 line)
box_scores = box_confidence * box_class_probs
### END CODE HERE ###

# Step 2: Find the box_classes thanks to the max box_scores, keep track of the corresponding score
### START CODE HERE ### (≈ 2 lines)
box_classes = K.argmax(box_scores, axis=-1)
box_class_scores = K.max(box_scores, axis=-1, keepdims=False)
### END CODE HERE ###

# Step 3: Create a filtering mask based on "box_class_scores" by using "threshold". The mask should have the
# same dimension as box_class_scores, and be True for the boxes you want to keep (with probability >= threshold)
### START CODE HERE ### (≈ 1 line)
filtering_mask = box_class_scores >= threshold
### END CODE HERE ###

# Step 4: Apply the mask to scores, boxes and classes
### START CODE HERE ### (≈ 3 lines)
scores = tf.boolean_mask(box_class_scores, filtering_mask)
boxes = tf.boolean_mask(boxes, filtering_mask)
classes = tf.boolean_mask(box_classes, filtering_mask)
### END CODE HERE ###

return scores, boxes, classes

Non-max suppression

Even after filtering by thresholding over the classes scores, you still end up a lot of overlapping boxes. A second filter for selecting the right boxes is called non-maximum suppression (NMS).

Non-max suppression

Non-max suppression uses the very important function called “Intersection over Union”, or IoU.

Definition of “Intersection over Union”

Hints :

  • In this exercise only, we define a box using its two corners (upper left and lower right): (x1, y1, x2, y2) rather than the midpoint and height/width.
  • To calculate the area of a rectangle you need to multiply its height (y2 - y1) by its width (x2 - x1)
  • You’ll also need to find the coordinates (xi1, yi1, xi2, yi2) of the intersection of two boxes. Remember that:
    • xi1 = maximum of the x1 coordinates of the two boxes
    • yi1 = maximum of the y1 coordinates of the two boxes
    • xi2 = minimum of the x2 coordinates of the two boxes
    • yi2 = minimum of the y2 coordinates of the two boxes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# GRADED FUNCTION: iou

def iou(box1, box2):
"""Implement the intersection over union (IoU) between box1 and box2

Arguments:
box1 -- first box, list object with coordinates (x1, y1, x2, y2)
box2 -- second box, list object with coordinates (x1, y1, x2, y2)
"""

# Calculate the (y1, x1, y2, x2) coordinates of the intersection of box1 and box2. Calculate its Area.
### START CODE HERE ### (≈ 5 lines)
xi1 = max(box1[0], box2[0])
yi1 = max(box1[1], box2[1])
xi2 = min(box1[2], box2[2])
yi2 = min(box1[3], box2[3])
inter_area = (xi2 - xi1) * (yi2 - yi1)
### END CODE HERE ###

# Calculate the Union area by using Formula: Union(A,B) = A + B - Inter(A,B)
### START CODE HERE ### (≈ 3 lines)
box1_area = (box1[2] - box1[0]) * (box1[3] - box1[1])
box2_area = (box2[2] - box2[0]) * (box2[3] - box2[1])
union_area = box1_area + box2_area - inter_area
### END CODE HERE ###

# compute the IoU
### START CODE HERE ### (≈ 1 line)
iou = inter_area / union_area
### END CODE HERE ###

return iou

You are now ready to implement non-max suppression. The key steps are:

  1. Select the box that has the highest score.
  2. Compute its overlap with all other boxes, and remove boxes that overlap it more than iou_threshold.
  3. Go back to step 1 and iterate until there’s no more boxes with a lower score than the current selected box.

This will remove all boxes that have a large overlap with the selected boxes. Only the “best” boxes remain.

TensorFlow has two built-in functions that are used to implement non-max suppression (so you don’t actually need to use your iou() implementation):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# GRADED FUNCTION: yolo_non_max_suppression

def yolo_non_max_suppression(scores, boxes, classes, max_boxes = 10, iou_threshold = 0.5):
"""
Applies Non-max suppression (NMS) to set of boxes

Arguments:
scores -- tensor of shape (None,), output of yolo_filter_boxes()
boxes -- tensor of shape (None, 4), output of yolo_filter_boxes() that have been scaled to the image size (see later)
classes -- tensor of shape (None,), output of yolo_filter_boxes()
max_boxes -- integer, maximum number of predicted boxes you'd like
iou_threshold -- real value, "intersection over union" threshold used for NMS filtering

Returns:
scores -- tensor of shape (, None), predicted score for each box
boxes -- tensor of shape (4, None), predicted box coordinates
classes -- tensor of shape (, None), predicted class for each box

Note: The "None" dimension of the output tensors has obviously to be less than max_boxes. Note also that this
function will transpose the shapes of scores, boxes, classes. This is made for convenience.
"""

max_boxes_tensor = K.variable(max_boxes, dtype='int32') # tensor to be used in tf.image.non_max_suppression()
K.get_session().run(tf.variables_initializer([max_boxes_tensor])) # initialize variable max_boxes_tensor

# Use tf.image.non_max_suppression() to get the list of indices corresponding to boxes you keep
### START CODE HERE ### (≈ 1 line)
nms_indices = tf.image.non_max_suppression(boxes, scores, max_boxes, iou_threshold, name=None)
### END CODE HERE ###

# Use K.gather() to select only nms_indices from scores, boxes and classes
### START CODE HERE ### (≈ 3 lines)
scores = K.gather(scores, nms_indices)
boxes = K.gather(boxes, nms_indices)
classes = K.gather(classes, nms_indices)
### END CODE HERE ###

return scores, boxes, classes

Wrapping up the filtering

It’s time to implement a function taking the output of the deep CNN (the 19x19x5x85 dimensional encoding) and filtering through all the boxes using the functions you’ve just implemented.

Exercise: Implement yolo_eval() which takes the output of the YOLO encoding and filters the boxes using score threshold and NMS. There’s just one last implementational detail you have to know. There’re a few ways of representing boxes, such as via their corners or via their midpoint and height/width. YOLO converts between a few such formats at different times, using the following functions (which we have provided):

1
boxes = yolo_boxes_to_corners(box_xy, box_wh)

which converts the yolo box coordinates (x,y,w,h) to box corners’ coordinates (x1, y1, x2, y2) to fit the input of yolo_filter_boxes

1
boxes = scale_boxes(boxes, image_shape)

YOLO’s network was trained to run on 608x608 images. If you are testing this data on a different size image–for example, the car detection dataset had 720x1280 images–this step rescales the boxes so that they can be plotted on top of the original 720x1280 image.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# GRADED FUNCTION: yolo_eval

def yolo_eval(yolo_outputs, image_shape = (720., 1280.), max_boxes=10, score_threshold=.6, iou_threshold=.5):
"""
Converts the output of YOLO encoding (a lot of boxes) to your predicted boxes along with their scores, box coordinates and classes.

Arguments:
yolo_outputs -- output of the encoding model (for image_shape of (608, 608, 3)), contains 4 tensors:
box_confidence: tensor of shape (None, 19, 19, 5, 1)
box_xy: tensor of shape (None, 19, 19, 5, 2)
box_wh: tensor of shape (None, 19, 19, 5, 2)
box_class_probs: tensor of shape (None, 19, 19, 5, 80)
image_shape -- tensor of shape (2,) containing the input shape, in this notebook we use (608., 608.) (has to be float32 dtype)
max_boxes -- integer, maximum number of predicted boxes you'd like
score_threshold -- real value, if [ highest class probability score < threshold], then get rid of the corresponding box
iou_threshold -- real value, "intersection over union" threshold used for NMS filtering

Returns:
scores -- tensor of shape (None, ), predicted score for each box
boxes -- tensor of shape (None, 4), predicted box coordinates
classes -- tensor of shape (None,), predicted class for each box
"""

### START CODE HERE ###

# Retrieve outputs of the YOLO model (≈1 line)
box_confidence, box_xy, box_wh, box_class_probs = yolo_outputs

# Convert boxes to be ready for filtering functions
boxes = yolo_boxes_to_corners(box_xy, box_wh)

# Use one of the functions you've implemented to perform Score-filtering with a threshold of score_threshold (≈1 line)
scores, boxes, classes = yolo_filter_boxes(box_confidence, boxes, box_class_probs, score_threshold)

# Scale boxes back to original image shape.
boxes = scale_boxes(boxes, image_shape)

# Use one of the functions you've implemented to perform Non-max suppression with a threshold of iou_threshold (≈1 line)
scores, boxes, classes = yolo_non_max_suppression(scores, boxes, classes, max_boxes, iou_threshold)

### END CODE HERE ###

return scores, boxes, classes

Reference